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Abstract 

Heavy-tailed distributions naturally occur in many real life problems. Unfortu- 
nately, it is typically not possible to compute inference in closed-form in graphical 
models which involve such heavy-tailed distributions. 

In this work, we propose a novel simple linear graphical model for independent 
latent random variables, called linear characteristic model (LCM), defined in the 
characteristic function domain. Using stable distributions, a heavy-tailed family 
of distributions which is a generalization of Cauchy, Levy and Gaussian distri- 
butions, we show for the first time, how to compute both exact and approximate 
inference in such a linear multivariate graphical model. LCMs are not limited to 
stable distributions, in fact LCMs are always defined for any random variables 
(discrete, continuous or a mixture of both). 

We provide a realistic problem from the field of computer networks to demon- 
strate the applicability of our construction. Other potential application is iterative 
decoding of linear channels with non-Gaussian noise. 



1 Introduction 

Heavy-tailed distributions naturally occur in many real life phenomena, for example in computer 
networks [14, 16, 23]. Typically, a small set of machines are responsible for a large fraction of the 
consumed network bandwidth. Equivalently, a small set of users generate a large fraction of the 
network traffic. Another common property of communication networks is that network traffic tends 
to be linear [8, 23]. Linearity is explained by the fact that the total incoming traffic at a node is 
composed from the sum of distinct incoming flows. 

Recently, several works propose to use linear multivariate statistical methods for monitoring net- 
work health, performance analysis or intrusion detection [13-16]. Some of the aspects of network 
traffic makes the task of modeling it using a probabilistic graphical models challenging. In many 
cases, the underlying heavy-tailed distributions are difficult to work with analytically. That is why 
existing solutions in the area of network monitoring involve various approximations of the joint 
probability distribution function using a variety of techniques: mixtures of distributions [8], spectral 
decomposition [13] historgrams [14], sketches [16], entropy [14], sampled moments [23], etc. 

In the current work, we propose a novel linear probabilistic graphical model called linear charac- 
teristic model (LCM) to model linear interactions of independent heavy-tailed random variables 
(Section 3). Using the stable family of distributions (defined in Section 2), a family of heavy-tailed 
distributions, we show how to compute both exact and approximate inference (Section 4). Using 
real data from the domain of computer networks we demonstrate the applicability of our proposed 
methods for computing inference in LCM (Section 5). 

We summarize our contributions below: 
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• We propose a new linear graphical model called LCM, defined as a product of factors in the 
cf domain. We show that our model is well defined for any collection of random variables, 
since any random variable has a matching cf. 

• Computing inference in closed form in linear models involving continuous variables is typ- 
ically limited to the well understood cases of Gaussians and simple regression problems in 
exponential families. In this work, we extend the applicability of belief propagation to the 
stable family of distributions, a generalization of Gaussian, Cauchy and Levy distributions. 
We analyze both exact and approximate inference algorithms, including convergence and 
accuracy of the solution. 

• We demonstrate the applicability of our proposed method, performing inference in real 
settings, using network tomography data obtained from the PlanetLab network. 

1.1 Related work 



There are three main relevant works in the machine learning domain which are related to the current 
work: Convolutional Factor Graphs (CFG), Copulas and Independent Component Analysis (ICA). 
Below we shortly review them and motivate why a new graphical model is needed. 

Convolutional Factor Graphs (CFG) [18, 19] are a graphical model for representing linear relation 
of independent latent random variables. CFG assume that the probability distribution factorizes 
as a convolution of potentials, and proposes to use duality to derive a product factorization in the 
characteristic function (cf) domain. In this work we extend CFG by defining the graphical model as 
a product of factors in the cf domain. Unlike CFGs, LCMs are always defined, for any probability 
distribution, while CFG may are not defined when the inverse Fourier transform does not exist. 

A closely related technique is the Copula method [17, 22]. Similar to our work, Copulas assume a 
linear underlying model. The main difference is that Copulas transform each marginal variable into 
a uniform distribution and perform inference in the cumulative distribution function (cdf) domain. In 
contrast, we perform inference in the cf domain. In our case of interest, when the underlying distri- 
butions are stable, Copulas can not be used since stable distributions are not analytically expressible 
in the cdf domain. 

A third related technique is ICA (independent component analysis) on linear models [27]. Assum- 
ing a linear model Y = AX 1 , where the observations Y are given, the task is to estimate the linear 
relation matrix A, using only the fact that the latent variables X are statistically mutually indepen- 
dent. Both techniques (LCM and ICA) are complementary, since ICA can be used to learn the linear 
model, while LCM is used for computing inference in the learned model. 

2 Stable distribution 



Stable distribution [30] is a family of heavy-tailed distributions, where Cauchy, Levy and Gaussian 
are special instances of this family (see Figure 1). Stable distributions are used in different prob- 
lem domains, including economics, physics, geology and astronomy [24]. Stable distribution are 
useful since they can model heavy-tailed distributions that naturally occur in practice. As we will 
soon show with our networking example, network flows exhibit empirical distribution which can be 
modeled remarkably well by stable distributions. 

We denote a stable distribution by a tuple of four parameters: S(a, (3, 7, 5). We call a as the char- 
acteristic exponent, (3 is the skew parameter, 7 is a scale parameter and 5 is a shift parameter. For 
example (Fig. 1), a Gaussian A/"(/i, <r 2 ) is a stable distribution with the parameters 5(2, 0, n), a 

Cauchy distribution Cauchy(7, 5) is stable with S(l, 0, 7, S) and a Levy distribution Levy(7, 8) is 
stable with <S(i, 1, 7, 6). Following we define formally a stable distribution. We begin by defining 
a unit scale, zero-centered stable random variable. 

Definition 2.1. [25, Def. 1.6] A random variable X is stable if and only ifX ~ aZ + b, < a < 2, 

— 1 < /3 < 1, a, b G M, a ^ and Z is a random variable with characteristic function 2 

1 V >l lexp(-H[l + ^| S ign( U )log(H)]) a = l 



'Linear model is formally defined in Section 3. 

2 We formally define characteristic function in the supplementary material. 
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Next we define a general stable random variable. 

Definition 2.2. [25, Def. 1.7] A random variable X is S(a, (3, 7, 6) if 

x ^ f 7 (Z-0tan(^)) + a a/1 
~ \"/Z + 5 a = 1 ' 



where Z is given by (1). X /ias characteristic function 




l)]+i&u) a=£l 
a = 1 



A basic property of stable laws is that weighted sums of a-stable random variables is a-stable (and 
hence the family is called stable). This property will be useful in the next section where we compute 
inference in a linear graphical model with underlying stable distributions. The following proposition 
formulates this linearity. 

Proposition 2.1. [25, Prop. 1.16] 

a) Multiplication by a scalar. If X ~ S(a, (3, 7, S) then for any a, b e R, a ^ , 

aX + b ~ 5(a, sign(a)/3, |a|7, a5 + 6) . 

fcj Summation of two stable variables, /f Xl ~ 5(a, /3i ,71, #i) anaf X 2 ~ <S(a, /?2, 72, ^2) 
are independent, then X\ + X2 ~ 5(a, /?, 7, 5) where 



Note that both X\ , X2 /iave fo foe distributed with the same characteristic exponent a. 

3 Linear characteristic models 

A drawback of general stable distributions, is that they do not have closed-form equation for the pdf 
or the cdf. This fact makes the handling of stable distributions more difficult. This is probably one 
of the reasons stable distribution are rarely used in the probabilistic graphical models community. 

We propose a novel approach for modeling linear interactions between random variables distributed 
according to stable distributions, using a new linear probabilistic graphical model called LCM. A 
new graphical model is needed, since previous approaches like CFG or the Copula method can not be 
used for computing inference in closed-form in linear models involving stable distribution, because 
they require computation in the pdf or cdf domains respectively. We start by defining a linear model: 

Definition 3.1. (Linear model) Let Xi, ■ ■ ■ , X n a set of mutually independent random variables. 3 
Let Yi, ■ ■ ■ , Y m be a set of observations obtained using the linear model: 



where A\j G R are weighting scalars. We denote the linear model in matrix notation asY — AX. 

Linear models are useful in many domains. For example, in linear channel decoding, X are the 
transmitted codewords, the matrix A is the linear channel transformation and Y is a vector of obser- 
vations. When X are distributed using a Gaussian distribution, the channel model is called AWGN 
(additive white Gaussian noise) channel. Typically, the decoding task is finding the most probable 
X, given A and the observation Y. Despite the fact that X are assumed statistically mutually inde- 
pendent when transmitting, given an observation Y, X are not independent any more, since they 
are correlated via the observation. Besides of the network application we focus on, other potential 
application to our current work is linear channel decoding with stable, non-Gaussian, noise. 

In the rest of this section we develop the foundations for computing inference in a linear model using 
underlying stable distributions. Because stable distributions do not have closed-form equations in 
the pdf domain, we must work in the cf domain. Hence, we define a dual linear model in the cf 
domain. 

3 We do not limit the type of random variables. The variables may be discrete, continuous, or a mixture of 
both. 
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3.1 Duality of LCM and CFG 

CFG [19] have shown that the joint probability p(x, y) of any linear model can be factorized as a 
convolution: 

p{x,y) = p(*i, • ■ • ,x n ,y lr -- ,y m ) = J^Jp^yi, • • • ,y m ) . (2) 

i 

Informally, LCM is the dual representation of (2) in the characteristic function domain. Next, we 
define LCM formally, and establish the duality to the factorization given in (2). 

Definition 3.2. (LCM) Given the linear model Y=AX, we define the linear characteristic model 
(LCM) 

ipitl, ■ ■ ■ ,t n> Sl, • • • , S m ) = Y\ s l, ' ' ' , s m) , 

i 

where ipiti, s%, ■ ■ ■ , s m ) is the characteristic function 4 of the joint distribution p(xt, yt, ■ ■ ■ , y m )- 

The following two theorems establish duality between the LCM and its dual representation in the 
pdf domain. This duality is well known (see for example [18, 19]), but important for explaining the 
derivation of LCM from the linear model. 

Theorem 3.3. Given a LCM, assuming p(x, y) as defined in (2) has a closed form and the Fourier 
transform J-[p(x, y)] exists, then the J-[p(x, y)] — <p(tx, ■ ■ ■ , t n , s±, ■ ■ ■ , s m ). 

Theorem 3.4. Given a LCM, when the inverse Fourier transform exists, then 
.F -1 (y>(*i, • • • ,t n ,si, ■ ■ ■ , s m )) = p(x, y) as defined in (2). 

The proof of all theorems is deferred to the supplementary material. Whenever the inverse Fourier 
transform exists, LCM model has a dual CFG model. In contrast to the CFG model, LCM are always 
defined, even the inverse Fourier transform does not exist. The duality is useful, since it allows us to 
compute inference in either representations, whenever it is more convenient. 

4 Main result: exact and approximate inference in LCM 

This section brings our main result. Typically, exact inference in linear models with continuous 
variables is limited to the well understood cases of Gaussian and simple regression problem in 
exponential families. In this section we extend previous results, to show how to compute inference 
(both exact and approximate) in linear model with underlying stable distributions. 

4.1 Exact inference in LCM 

The inference task typically involves computation of marginal distribution or a conditional distri- 
bution of a probability function. For the rest of the discussion we focus on marginal distribution. 
Marginal distribution of the node x t is typically computed by integrating out all other nodes: 



p(xi\y) ~ J p(x,y) d x \i , 

X\i 

where X \ i is the set of all nodes excluding node i. Unfortunately, when working with stable 
distribution, the above integral is intractable. Instead, we propose to use a dual operation called 
slicing, computed in the cf domain. 

Definition 4.1. (slicing/evaluation) [28, p. 110] 

(a) Joint cf. Given random variables Xi,X% the joint cf w <px 1 ,x 2 {ti-i~b'i) — E[e ltlXl+lt2X2 ]. 

(b) Marginal cf. The marginal cf is derived from the joint cf by fxxifi) = <Px 1 ,x 2 (ti,0)- 
This operation is called slicing or evaluation. We denote the slicing operation as ipx 1 (t{) = 

<fiX 1 ,X 2 {tl,t2) 

- t 2 =0 

The following theorem establishes the fact that marginal distribution can be computed in the cf 
domain, by using the slicing operation. 



4 Defined in the supplementary material. 
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for i 6 |T| 
{ 

Eliminate ti by computing 

<f>m+i(N(ti)) = ]^[ *(^. "I-"- .'») 

*>i eJv(*i) 

Remove <p{tj : s\.--- , s m ) and from LCM. 

Add </> m +i to LCM. 

} 

Finally: If J 7 1 exists, compute 

p(xi) ~ J r ^ 1 (4> final) ■ 




Algorithm 1 : Exact inference in LCM using 
LCM-Elimination. 



Figure 1 : The three special cases of stable 
distribution where closed-form pdf exists. 

Theorem 4.2. Given a LCM, the marginal cfofthe random variable Xi can be computed using 



(3) 



T\i=0 



In case the inverse Fourier transform exists, then the marginal probability of the hidden variable Xi 
is given by p(x t ) ~ J r ~ 1 {^(t l )} . 



Based on the results of Thm. 4.2 we propose an exact inference algorithm, LCM-Elimination, for 
computing the marginal cf (shown in Algorithm 1). We use the notation N(k) as the set of graph 
neighbors of node k, excluding fc 5 . T is the set {ti, • • • , t n }. 

LCM-Elimination is dual to CFG-Elimination algorithm [19]. LCM-Elimination operates in the cf 
domain, by evaluating one variable at a time, and updating the remaining graphical model accord- 
ingly. The order of elimination does not affect correctness (although it may affect efficiency). Once 
the marginal cf (p(ti), is computed, assuming the inverse Fourier transform exists, we can compute 
the desired marginal probability p(xi). 



4.2 Exact inference in stable distributions 



After defining LCM and showing that inference can be computed in the cf domain, we are finally 
ready to show how to compute exact inference in a linear model with underlying stable distributions. 
We assume that all observation nodes Yi are distributed according to a stable distribution. From 
the linearity property of stable distribution, it is clear that the hidden variables Xi are distributed 
according to a stable distribution as well. The following theorem is one of the the novel contributions 
of this work, since as far as we know, no closed-form solution was previously derived. 

Theorem 4.3. Given a LCM, Y = AX + Z, with n i.i.d. hidden variables Xi ~ iS(a, f3 Xi , j Xi ,S Xi ), 
ni.i.d. noise variables with known parameters Zi ^ S(a, f3 Zi ,j Zi , S Zi ), and n observations yi £ R, 
assuming the matrix A nxn is invertible 6 , then 

a) the observations Yi are distributed according to stable distribution Yi ~ S(a, f3 Vi , j Vi , 5 Vi ) with 
the following parameters: 

^ = \A\ a ^ + 7z Q , p y = ly ~ a [(\A\ a sign^l))^ 7,) + i3 z Q 7 ,], S v = A8 X + 

= ftan(^)[/? y 7!/ -A(/3 a; 7 a; )-/3 z 7 z ] a/1 
Kv \^[l3 y Qj y Q\o g { ly )~AQ\og{\A\){l3 x Qj x )-A(l3 x Q-f x Q\og(-f x ))~l3 z Qj z ] a = l ' 

b) the result of exact inference for computing the marginals p(xi\y) ~ S(a, /3 Xi \ y ,J Xi \ y , 5 X( \y) ' s 
given in vector notation: 

^l,=7^©[(l^r0sign(A))- 1 (/3 H 7 H a )], 7"|„ = (HTS?, 8 x{y = A^Sy - (4) 



More detailed explanation of the construction of a graphical model out of the linear relation matrix A is 
found on [4, Chapter 2.3]. 

6 To simplify discussion we assume that the length of both the hidden and observation vectors \X\ = \Y\ — 
n. However the results can be equivalently extended to the more general case where \X\ = n, \Y\ = m,m 7^ 
n. See for example [6]. 
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Initialize: mij(xj) — 1, VAy ^ 0. 
Iterate until convergence 




Initialize: mij(xj) — 1, VAij ^ 0. 
Iterate until convergence 


my (tj) = <Pi(ti, si, ■ ■ ■ , S m ) Yl m ki(ti) 
fc£N(i)\j 




i 4 fceN(i)\j 


Finally : 




Finally : 


ip(U) = <fi(ti, Si, ■ ■ ■ , s m ) Y| m ki (ti) 

(a) teJV(l) 




p(xj) =p(aJi,yi,--- ,2/m.)* P| m ki (xi). 


Initialize: /3 X . | y , 7 X . | y , S x . | y =5(q, 0,0,0), 
Iterate until convergence: 






j^i j^i j^i 

tan(^)[^ yi 7 Hi - Y.j ^j&jIbI.-i, ] a 1 
Cxja = < J l-c l-c (6) 
[iWyilvi log(7»i) - Ej^y^o^wiosd^yD&jIsT^j, -Ej A «^j|yT<. i |»log(7 <f v )] a = 1 


Output: x 4 |a ~ 5(a,0 x .| H /7°.| H ,7 x .| B ,(5 :c .| ! ,) 




(c) 



Algorithm 2: Approximate inference in LCM using the (a) Characteristic-Sum-Product (CSP) algorithm (b) 
Integral Convolution (IC) algorithm. Both are exact on tree topologies, (c) Stable-Jacobi algorithm. 

= ftan(2S)ta„©7 s - J 403 ir | v ©7«| w )] a^l 
^ xlv \||A,©7«0M7«)-(^©log(|A|)(^ a | 8 ©7«|J-4(^«| v 7„| v ©Iog(7.| s ))] q = 1 ' 

(5) 

where is f/;e enfryw/ie product (of both vectors and matrices), \A\ is the absolute value (entrywise) 
log(A), A a , sign(^4) are enfryvw.se matrix operations and /3 X = [/3 Xl , Px n ] T and the same for 

x j &y 7 &z • 



4.3 Approximate Inference in LCM 

Typically, the cost of exact inference may be expensive. For example, in the related linear model of a 
multivariate Gaussian (a special case of stable distribution), LCM-Elimination reduces to Gaussian 
elimination type algorithm with a cost of 0(n 3 ), where n is the number of variables. Approximate 
methods for inference like belief propagation [26], usually require less work than exact inference, 
but may not always converge (or convergence to an unsatisfactory solution). The cost of exact 
inference motivates us to devise a more efficient approximations. 

We propose two novel algorithms that are variants of belief propagation for computing approximate 
inference in LCM. The first, Characteristic-Slice-Product (CSP) is defined in LCM (shown in Algo- 
rithm 2(a)). The second, Integral-Convolution (IC) algorithm (Algorithm 2(b)) is its dual in CFG. 
As in belief propagation, our algorithms are exact on tree graphical models. The following theorem 
establishes this fact. 

Theorem 4.4. Given an LCM with underlying tree topology ( the matrix A is an irreducible adja- 
cency matrix of a tree graph), the CSP and IC algorithms, compute exact inference, resulting in the 
marginal cf and the marginal distribution respectively. 

The basic property which allows us to devise the CSP algorithm is that LCM is defined as a prod- 
uct of factor in the cf domain. Typically, belief propagation algorithms are applied to a probability 
distribution which factors as a product of potentials in the pdf domain. The sum-product algorithm 
uses the distributivity of the integral and product operation to devise efficient recursive evaluation of 
the marginal probability. Equivalently, the Characteristic-Slice-Product algorithm uses the distribu- 
tivity of the slicing and product operations to perform efficient inference to compute the marginal 
cf in the cf domain, as shown in Theorem 4.4. In a similar way, the Integral-Convolution algorithm 
uses distributivity of the integral and convolution operations to perform efficient inference in the pdf 
domain. Note that the original CFG work [18, 19] did not consider approximate inference. Hence 
our proposed approximate inference algorithm further extends the CFG model. 

4.4 Approximate inference for stable distributions 

For the case of stable distributions, we derive an approximation algorithm, Stable-Jacobi (Algo- 
rithm 2(c)), out of the CSP update rules. The algorithm is derived by substituting the convolution 
and multiplication by scalar operations (Prop. 2.1 b,a) into the update rules of the CSP algorithm 
given in Algorithm 2(a). 
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; Levi fit (5.4G48e-Q4, 9.99e-Q1)| 



(a) / 





Figure 2: (a) Distribution of network flows on a typical PlanetLab host is fitted quite well with a Levy dis- 
tribution, (b) The core of the PlanetLab network. 1% of the flows consists of 19% of the total bandwidth, (c) 
Convergence of Stable-Jacobi. 

Like belief propagation, our approximate algorithm Stable-Jacobi is not guaranteed to converge on 
general graphs containing cycles. We have analyzed the evolution dynamics of the update equations 
for Stable-Jacobi and derived sufficient conditions for convergence. Furthermore, we have analyzed 
the accuracy of the approximation. Not surprisingly, the sufficient condition for convergence relates 
to the properties of the linear transformation matrix A. The following theorem is one of the main 
novel contributions of this work. It provides both sufficient condition for convergence of Stable- 
Jacobi as well as closed-form equations for the fixed point. 

Theorem 4.5. Given a LCM with n i.i.d hidden variables Xi, n observations Y% distributed ac- 
cording to stable distribution Yi ~ <S(a, /3 yi , r ) yi , S yi ), assuming the linear relation matrix A nxn is 
invertible and normalized to a unit diagonal 1 , Stable-Jacobi (as given in Algorithm 2(c)) converges 
to a unique fixed point under both the following sufficient conditions for convergence (both should 
hold): 

(1) p(\R\ a )<l, (2) P (R)<1. 

where p{R) is the spectral radius (the largest absolute value of the eigenvalues of R), R = I — A, 
\R\ is the entrywise absolute value and \R\ a is the entrywise exponentiation. Furthermore, the 
unique fixed points of convergence are given by equations (4)-(5). The algorithm converges to the 
exact marginals for the linear-stable channel* 

5 Application: Network flow monitoring 

In this section we propose a novel application for inference in LCMs to model network traffic flows 
of a large operational worldwide testbed. Additional experimental results using synthetic examples 
are found in the supplementary material. Network monitoring is an important problem in monitoring 
and anomaly detection of communication networks [8, 15, 16]. We obtained Netflow PlanetLab net- 
work data [10] collected on 25 January 2010. The PlanetLab network [1] is a distributed networking 
testbed with around 1000 server nodes scattered in about 500 sites around the world. We define a 
network flow as a directed edge between a transmitting and receiving hosts. The number of packets 
transmitted in this flow is the scalar edge weight. 

We propose to use LCMs for modeling distribution of network flows. Figure 2(a) plots a distribution 
of flows, sorted by their bandwidth, on a typical PlanetLab node. Empirically, we found out that 
network flow distribution in a single PlanetLab node are fitted quite well using Levy distribution a 
stable distribution with a = 0.5, (3 = 1. The empirical means are mean(7) « le~ 4 , mean(5) « 1. 
For performing the fitting, we use Mark Veillette's Matlab stable distribution package [31]. 

Using previously proposed techniques utilizing histograms [16] for tracking flow distribution in 
Figure 2(a), we would need to store 40 values (percentage of bandwidth for each source port). 
In contrast, by approximating network flow distribution with stable distributions, we need only 4 

7 When the matrix A is positive definite it is always possible to normalize it to a unit diagonal. The nor- 

l _ l 

malized matrix is D~2 AD~2 where D = diag(^4). Normalizing to a unit diagonal is done to simplify 
convergence analysis (as done for example in [12]) but does not limit the generality of the proposed method. 

8 Note that there is an interesting relation to the walk-summability convergence condition [12] of belief 
propagation in the Gaussian case: p(|-R|) < 1. However, our results are more general since they apply for any 
characteristic exponent < a < 2 and not just for a — 2 as in the Gaussian case. 
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parameters (a, /3, 7, S)\ Thus we dramatically reduce storage requirements. Furthermore, using 
the developed theory in previous sections, we are able to linearly aggregate distribution of flows in 
clusters of nodes. 

We extracted a connected component of traffic flows connecting the core network 652 nodes. We fit- 
ted a stable distribution characterizing flow behavior for each machine. A partition of 376 machines 
as the observed flows Yi (where flow distribution is known). The task is to predict the distribution of 
the unobserved remaining 376 flows Xi, based on the observed traffic flows (entries of Aij). We run 
approximate inference using Stable-Jacobi and compared the results to the exact result computed by 
LCM-Elimination. We emphasize again, that using related techniques (Copula method , CFG, and 
ICA) it is not possible to compute exact inference for the problem at hand. In the supplementary ma- 
terial, we provide a detailed comparison of two previous approximation algorithms: non-parametric 
BP (NBP) and expectation propagation (EP). 

Figure 2(c) plots convergence of the three parameters f3, 7, 6 as a function of iteration number of 
the Stable-Jacobi algorithm. Note that convergence speed is geometric. (p(R) = 0.02 << 1). 
Regarding computation overhead, LCM-Exact algorithm requires 4 • 376 3 operations, while Stable- 
Jacobi converged to an accuracy of le~ 5 in only 4 • 376 2 • 25 operations. Additional benefit of the 
Stable-Jacobi is that it is a distributed algorithm, naturally suitable for communication networks. 
Source code of some of the algorithms presented here can be found on [3]. 

6 Conclusion and future work 

We have presented a novel linear graphical model called LCM, defined in the cf domain. We have 
shown for the first time how to perform exact and approximate inference in a linear multivariate 
graphical model when the underlying distributions are stable. We have discussed an application of 
our construction for computing inference of network flows. 

We have proposed to borrow ideas from belief propagation, for computing efficient inference, based 
on the distributivity property of the slice-product operations and the integral-convolution operations. 
We believe that other problem domains may benefit from this construction, and plan to pursue this 
as a future work. 

We believe there are several exciting directions for extending this work. Other families of distri- 
butions like geometric stable distributions or Wishart can be analyzed in our model. The Fourier 
transform can be replaced with more general kernel transform, creating richer models. 
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7 Supplementary material 



Definition 7.1. Characteristic Function. For a scalar random variable X, the characteristic func- 
tion is defined as the expected value ofe ltx where i is the imaginary unit, and t € R. « the argument 
of the characteristic function: fx(t) — E[e ltx ] = J_ OQ e ltx dFx{x) where Fx{x) is the cumula- 
tive distribution function of X. If a random variable X has a probability density function fx, then 

oo 

the characteristic function is its Fourier transform, (px{t) = J e ltx fx{x)dx. 

— oo 

7.1 Proof of Theorem 3.3 

Proof. 

F(p{x,y)) = ^{\\(p{xi,yx,--- ,y m )) = Y\j r {p{x i ,y 1 ,- ■ ■ ,y m )) = 



Y\ <p{U, si, • •■ , s m ) = <p(ti, ■ • ■ ,t», si, ■■ ■ , s r , 



□ 



7.2 Proof of Theorem 3.4 

Proof. 



T 1 (<p(ti, ■ ■ ■ ,t„, si, ■ ■ ■ , s m )) = T (Y\ Si > ' ' ' ' Sm )) 

i 

= 17 J r ~ 1 (*'(*i 1 Si, ■ • • , S m )) = ' ' ' ,Vm) = P(x,y) . 



□ 



7.3 Proof of Theorem 4.2 



Proof. The proof follows from the Projection-Slice theorem (also known as the Central Slice theo- 
rem) [20, p. 349], which is briefly stated here. Let f(x, y) be a multivariate function and F(u, v) be 
its matching Fourier transform. Then 



oo -/ — oo 



Hf{*)} = H f(x,y)dy} = e mx [ f(x,y)dy]dx = / e mx f(x, y)dxdy = F(u, 0) . 

J — oo J — oo — oo — OO ' 

This theorem is naturally extended to multiple variables. In our case, 



ifi(0,0,U, ■■■ ,0) = Y[<p(tj,si,- ■ ■ ,s„ 



J T\i=0 



e **i*i [J| p ( I;i!/li ... )2/m ) ]d X 



/OO /'OO * /» * 

e tiX *[ Y[p(xj,yi,-- ,y m )d X \i]dxi = F{ / [Y[p(.Xj,yi,- ■ ■ ,ym)]dx\i} = F{p(xi)}. 
- oo ./ — oo J ■ 

3 X\i J 

□ 

7.4 Proof of Theorem 4.3 

Proof. For simplicity, we do not handle the noise variable z in this proof. The noise can be added as a 
regularization later. We use the linear relation between distributions to extract X: X = A~ 1 Y. Note 
that X must distribute according to stable distribution since it is composed from linear combination 
of stable variables. For the scale parameter we get (using the linearity of A substituted in Prop. 2.1 
(a),(b)) 

j 

In vector notation we got 

iy = \A\ a i: . 
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Solving this linear system of equations we get 

i a x \ v = {\A\ a r 1 hy]- 

Regarding the skew parameter j3 x using Prop. 2. l(a,b) we get that 



In vector notation we get 

Pv = ly a © [(sign(^) \A\ a )(J3 x >£)] . 
Now assume that 7° is a known constant, we can exact f3 x and get 

P*\v = My © K(Bign(A) \A\«)- l {fi v 7 Q )] 
Regarding the location parameter S x , 

fiyi = Ajj + i< j 



tan(^)[^ i7w " Ej 8^(^)14*1^7*,] « 7^ 1 

^f[A«7w log(7») - Ej sign^OIA^I^T^ logflAy^)] a = 1 ' 
In matrix notation (after some algebra) we get 

S y = AS X + £ 

rtaa(^)[^ v ©7 y -A(^0 7 x )] a^l 
.f[)S tf 7w©log(7v)-(- A ©log(|A|))(^x0 7»)-AG8»©7x©log(7w))] <*=1 ' 
In total we got a linear system that is solved using 

5«| 1/ =A- 1 (5 v -0- 

□ 

7.5 Proof of Theorem 4.4 

Proof. W.l.g we prove for the Slice-product algorithm 2(a). The other algorithms are symmetric be- 
cause the slice/convolution and integral/convolution operations maintain the distributivity property 
as well. 

We are interested in computing the posterior marginal probability 



p( Xi ) = J p(x,y)dx\i~ J p(xi,--- ,x n ,yi,"' ,ym)d x \i (7) 

x\i X\i 

= J 7 - 1 {T[<p(t i ,s 1 ,--- ,s m )] }. (8) 

W.l.g assume that JQ is a tree root. Its matching marginal cf ip(U) can be written as a combination 
of incoming message computed by the neighboring sub trees: 

ip(ti) ~ (p(U,S!,--- ,s m ) Yl mjifc), 
jeN(i) 

where the messages rriji (tj) are defined by the algorithm 2(a). We prove using full induction on the 
tree diameter. The messages rriji(ti) satisfy the recursion: 

mji(ti) = ip(ti,si,--- ,s m ) TT rn kj {xj) 

keN(j)\i 

The basis for the induction is a tree with a single node x\. In this case there are no incoming 
messages, <p(ti) — tp(ti, Si, • • ■ , s m ) and we are done. Now assume that the induction assumption 
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holds for a tree with diameter d — 1 or less and we want to prove it for a tree with diameter d. We 
make the following construction. We add a new node x.- L to the tree to get a tree with diameter d. 
This node has one or more neighbors j G N(i). 

¥>(*i) ~ f(ti,si, ■ ■ ■ ,s m ) Yl m ii{ t i) = 

j£N(i) 

= (U,S1, ■ • ■ ,S m ) P(tj,s lr -- ,S m ) Y[ m lj{tj)} 

jeN(i) ieN(j)\i 

Using distributivity of the slice/product (algorithm 2(a)), and the tree assumption (separate trees 
connected to node k are disjoint), we interchange order of operators to get: 



t,-=o 



(f(U) ~ Lp(ti,S!,--- ,S m )[Y[(p(tj,Si,-- ,s m ))] 



r 'X\i 



This completes the proof since we have obtained the formulation (8). 



□ 



7.6 Proof of Theorem 4.5 

Proof. We start with the scale parameter calculation since it is decoupled from the other parameters. 

This iteration is a Jacobi iteration for solving the linear system 

\A\ a i« = T y 

The linear system solution is given in (4) as desired. It is further known that this iteration converges 
when p(\R\ a ) < 1. 

Regarding the skew parameter /3 the Stable-Jacobi update rule is: 

fleil* = faitf, - ^2sign(A ij )\A ij \ a l3 Xjly . 

This conforms to the Jacobi equation for solving the linear system 

[\A\ a Q signup, = p y &^ 
Assuming this system converged, we divide by 7" to get (4) 

P*\v = %\y Il^r ©sign^)]- 1 ^ 7,1 ■ 

The iteration for computing a skew parameter j3 converges when p(\R\ a & sign(i2)) < 1. Using [9, 
Theorem 8.4.5, Section 8.4] we get that p(\R a sign(i?)|) = p(|i? Q |) > p(\R\ a sign(i?)). In 
other words, when the sufficient condition for the scale parameter 7 holds (p(|i? Q |) < 1), then the 
skew parameter f3 converges as well. 

Now we analyze the shift parameter S evolution. The parameter is given by 

This is a Jacobi equation for solving the linear system 

A8 X = S y — £ x 

Is given in (4). This iteration converges when p(R) < 1, which is the second sufficient condition for 
convergence. □ 
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7.7 Synthetic example 



We demonstrate the properties Stable-Jacobi, using a small toy example. Experimental settings 
are borrowed from [32]. The linear transformation matrix is a synchronous CDMA channel trans- 

/ 7 -1 3\ 

formation with a cross correlation matrix A3 = j — 1 7 5 . As expected from the conver- 

gence analysis, the sufficient conditions for convergence hold since p(|i?3|) = 0.9008 < 1 and 
p(|i?3| 15 ) = 0.6875 < 1, and indeed the algorithm converges. We initialized x = [1, 1, 1] and 
the additive noise Z x ~ 5(1.5,0, 1,0), Z 2 ~ 5(1.5,0.5, 1,0), ^3 ~ 5(1.5,0,1,0). After comput- 
ing p(y), we computed p(x \ y) using Stable-Jacobi. Regarding convergence dynamics, convergence 
analysis shows that 6 is converging more slowly since it is dependent on both f3 and 7. Figure 3(a) 
shows convergence of message L2 norms. Figure 3(b) plots the Euclidian distance of the intermedi- 
ate solution on each round (as a vector in IR 3 ) to the exact solution computed by LCM-Stable. This 
distance indeed goes to zero as expected. Figure 3(c) shows the same distance but using log plot. 
The almost straight line indicates that the distance is diminishing in a geometric fashion. Unlike the 
global distance which diminishes mono tonic ally, when examining the second entry of the interme- 
diate solution vector X2 (Figure 3(d)), we see the zigzag behavior which is a well known property 
of the Jacobi algorithm [2]. This non-monotonic behavior is demonstrated also when examining the 
Euclidian distance along the single dimension of X2 (Figure 3(e)). 




Figure 3: Evolution dynamics of the Stable-Jacobi algorithm using CDMA correlation matrix 
A3. All three parameters f3, 7, 5 are plotted per round, where the parameter of interest is the scale 
parameter 6 (since i detection is performed by applying the sign operation on it. (a) Convergence 
of message norm (norm of the current posterior relative to the posterior from previous round) when 
j3 0. (b) Distance of marginal to true solution (using regular plot), (c) Distance of the vector of 
marginals for the three users relative to the real transmission (using log plot), (d) Marginal posterior 
for user 2. The location parameter 6 conforms to the binary transmission, (e) Distance between the 
posterior of user 2 to the true solution per iteration. 



7.8 Comparison to previous work 

In this section we compare our exact computation of inference in the linear stable model to pre- 
vious techniques. Since stable distributions do not have a closed-form in the probability domain, 
any solution deployed in the pdf domain must involve approximation. We investigate two existing 
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approximate inference methods as a reference to our newly developed methods. The first algorithm 
we implemented is Non-parametric belief propagation (NBP) [29]. NBP works by approximating 
the observed stable distribution using a Gaussian mixture, and then computes a belief propagation 
procedure using a linear graphical model [5]. The second algorithm is Expectation Propagation 
(EP) variant of belief propagation [21], which approximates each Gaussian mixture message using 
a single Gaussian. 

Algorithm 3 lists the approximation methodology we used. Figure 4 depicts the different steps in- 
volved. The first tree steps prepare the input to the NBP/EP algorithms by converting the distribution 
into a mixture of Gaussians, with a relatively low number of mixture components (to allow for effi- 
cient execution). We construct a linear model graphical as described in [5]. Then we run NBP/EP for 
a predefined number of rounds and output the computed belief. Next we can fit stable distribution 
parameter to the output. 

As well known, a drawback of the NBP algorithm is that the number of mixture components grows 
exponentially when computing the product step of the belief propagation algorithm. To avoid expo- 
nential blowup, efficient reduction methods where developed [7, 11]. However, the efficiency comes 
at the cost of reduced accuracy. 

When working even with small problems (tens of variables) both algorithms did not perform well 
relative to our exact inference method. For example, on a 2D grid graph of 100 hidden nodes and 
100 observation nodes, we got an average scale around 0.8 146 while the average of true hidden scale 
was 1 . The averages of the skew and shift parameters where even worse, since they are dependent 
on the scale parameter. 

We tried to pinpoint to the root causes of reduced accuracy by constructing a small toy example. We 
constructed a small graphical model of two hodden nodes X±, X2 and two observed nodes Yi, Y2. 
The hidden node are initialized using a Cauchy distribution with variance 1. To further simplify 



transformation y = Ax. 

Even for this small problem we can see (Fig. 4(d)) that the NBP output does not match exactly the 
true solution. We believe that the largest error is rooted in the product step approximation. 

Another possible approach is to use Expectation Propagation. EP operates by approximating each 
Gaussian mixture with a single mixture, creating a light weight and faster approximation (relative 
to NBP). Fig. 4(e) shows an EP approximation of a single mixture. For Cauchy distributions, EP 
captured quite well the shape of the distribution (Fig. 4(f)), but less well the exact mean. However, 
for skewed distributions, EP does not capture well the distribution shape, since the distribution shape 
can not be approximated using a single Gaussian (Fig. 4(g)). 



1. Quantisize the stable distribution. 

2. Fit a Gaussian mixture to the quantisized observation using Kernel Ridge Regression. 

3. Optional: reduce the number of mixture components using sampling techniques. 

4. Run non-parametric belief propagation [29] or expectation propagation [21]. 

5. Quantisize the resulting mixture. 

6. Fit a stable distribution to the quantization and retrieve the parameters. 



Overall, we conclude that using previous techniques, it is significantly more difficult to compute 
inference in a linear-stable model and the results obtained are not accurate. In contrast, using our 
developed exact inference procedure the solution is obtained exactly by simply computing three 
matrix inverses. 




Algorithm 3: Approximate inference for linear stable model. 
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(a) Step 1: quantization (b) Step 2: fitting a Gaus- (c) Step 3: Resampling (d) Step 4: Output of 
sian mixture NBP 




(e) Expectation-Propagation (f) Step 4: output of EP (g) Step 4: output of EP 
estimate (Cauchy prior) (Levy) prior 



Figure 4: Approximated inference using previous techniques: non-parametric BP and Expectation 
Propagation. In contrast, Stable-Exact computes inference directly in this model by inverting 3 
matrices. 
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This figure "graph.png" is available in "png" format from: 



http://arxiv.org/ps/1008.5325v4 
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